Introduction

In the first part of the course, we will go over some statistical preliminaries and corresponding computational aspects. We will learn:

  1. To write down a likelihood function
  2. Meaning of the likelihood function
  3. Meaning of the Maximum likelihood estimator, difference between a parameter, an estimator and an estimate
  4. Good properties of an estimator: Consistency, Asymptotic normality, Unbiasedness, low Mean squared error
  5. Frequentist paradigm and quantification of uncertainty
  6. How to use Fisher information for an approximate quantification of uncertainty
  7. Motivation for the Bayesian paradigm
  8. Meaning of the prior distribution
  9. Derivation and meaning of the posterior distribution
  10. Interpretation of a credible interval and a confidence interval
  11. Scope of inference: Should I specify the hypothetical experiment or should I specify the prior distribution? Each one comes with its own scope of inference.

Occupancy studies

Let us start with an occupancy study. Suppose we have a site that is being considered for development. There is a species of interest that might get affected by this development. Hence we need to study what proportion of the area is occupied by the species of interest. If this proportion is not very large, we may go ahead with the development.

Suppose we divide the site in several equal-area cells. Suppose all cells have similar habitats (identical). Further we assume that occupancy of one cell does not affect occupancy of other quadrats (independence). Let \(N\) be the total number of cells.

Let \(Y_i\) be the occupancy status of the i-th quadrat. This is unknown and hence is a random variable. It takes values in (0,1), 0 meaning unoccupied and 1 meaning occupied. This is a Bernoulli random variable. We denote this by \(Y_{i}\sim Bernoulli(\phi)\). The random variable \(Y\) takes value 1 with probability \(\phi\). This is the probability of occupancy. The value of \(\phi\) is unknown. This is the parameter of the distribution.

Suppose we visit \(n\), a subset, of these cells. These are selected using simple random sampling without replacement. The observations are denoted by \(y_1,y_2,y_3,...,y_n\). We can use these to infer about the unknown parameter \(\phi\). The main tool for such inductive inference (data to population and not hypothesis to prediction) is the likelihood function.

The likelihood function

Suppose the observed data are (0,1,1,0,0). Then we can compute the probabiity of observing these data under various values of the parameter \(\phi\) (assuming independent, identically distributed Bernoulli random variables). It can be written as: \(L(\phi;y_{1},y_{2},...,y_{n})={\prod}P(y_{i};\phi)={\prod}\phi^{y_{i}}(1-\phi)^{1-y_{i}}\)

Notice that this is a function of the parameter \(\phi\) and the data are fixed. The likelihood function or equivalently the log-likelihood function quantifies the relative support for different values of the parameters. Hence only the likelihood ratio function is meaningful.

Maximum likelihood estimator

A natural approach to estimation (inference) of \(\phi\) is to choose the value that is better supported than any other value in the parameter space \((0,1)\). This is called the maximum likelihood estimator. We can show that this turns out to be: \(\hat{\phi}=\frac{1}{n}\sum y_{i}\) This is called an estimate. This is a fixed quantity because the data are observed and hence not random.

Quantification of uncertainty

As scientists, would you stop at reporting this? I suspect not. If this estimate is large, say 0.85, the developer is going to say ‘you just got lucky (or, worse, you cheated) with your particular sample’. A natural question to ask then is ‘how different this estimate would have been if someone else had conducted the experiment?’. In this case, the ‘experiment to be repeated’ is fairly uncontroversial. We take another simple random sample without replacement from the study area. However, that is not always the case as we will see when we deal with the regression model.

The sampling distribution is the distribution of the estimates that one would have obtained had one conducted these replicate experiments. It is possible to get an approximation to this sampling distribution in a very general fashion if we use the method of maximum likelihood estimator. In many situations, it can be shown that the sampling distribution is \(\hat{\phi}\sim N(\phi,\frac{1}{n}I^{-1}(\phi))\) where \(I(\phi)=-\frac{1}{n}\sum\frac{d^2}{d^2\phi}logL(\phi;{y})\)

This is also called the Hessian matrix or the curvature matrix of the log-likelihood function. The higher the curvature, the less variable are the estimates from one experiment to other. Hence the resultant ‘estimate’ is considered highly reliable.

95% Confidence interval

This is just a set of values that covers the estimates from 95% of the experiments. The experiments are not actually replicated and hence this simply tells us what the various outcomes could be. Our decisions could be based on this variation as long as we all agree on the experiment that could be replicated. We are simply covering our bases against the various outcomes and protect ourselves from future challenges. If we use the maximum likelihood estimator, we can obtain this as: \(\hat{\phi}-\frac{1.96}{n}\sqrt{I^{-1}(\hat{\phi})},\hat{\phi}+\frac{1.96}{n}\sqrt{I^{-1}(\hat{\phi})}\) You will notice that as we increase the sample size, the width of this interval converges to zero. That is, as we increase the sample size, the MLE converges to the true parameter value. This is called the ‘consistency’ of an estimator. This is an essential property of any statistical inferential procedure.

Bayesian paradigm

All the above statements seem logical but fake at the same time! No one repeats the same experiment (although replication consistency is an essential scientific requirement). What if we have time series? We can never replicate a time series. So then should we simply take the estimated value prima facie? That also seems incorrect scientifically. So where is the uncertainty in our mind coming from? According to the Bayesian paradigm, it arises because of our ‘personal’ uncertainty about the parameter values.

Prior distribution: Suppose we have some idea about what values of occupancy are more likely than others before any data are collected. This can be quantified as a probability distribution on the parameter space (0,1). This distribution can be anything, unimodal or bimodal or even multimodal! Let us denote this by \(\pi(\phi)\). How do we change this after we observe the data?

Posterior distribution This is the quantification of uncertainty after we observe the data. Usually observing the data decreases our uncertainty, although it is not guaranteed to be the case. The posterior distribution is obtained by: \(\pi(\phi|y)=\frac{L(\phi;y)\pi(\phi)}{\int L(\phi;d\phi y)\pi(\phi)}\)

Credible interval

This is obtained by using the percentiles of the posterior distribution.

Notice a few things here:

  1. This involves an integral in the denominator. Depending on how many parameters (unknowns) are in the model, this can be a large dimensional integral. Imagine a regression model with 5 covariates. This integral will be 6 dimensional (add one for the variance).
  2. Data are fixed. We do not need to replicate the experiment. The uncertainty is completely in the mind of the researcher.
  3. Different researchers might have different prior uncertainties. This will lead to different posterior uncertainties. Hence this is a subjective or personal quantification of uncertainty. It is not transferable from one researcher to another.

An interesting result follows. As we increase the samples size, the Bayesian posterior, for ANY prior, converges to the distribution that looks very much like the frequentist sampling distribution of the MLE. That is, \(\pi(\phi|y)\thickapprox N(\hat{\phi},\frac{1}{n}I^{-1}(\hat{\phi}))\) There are subtle differences that we are going to ignore here. Qualitatively, what this says is that for large sample size:

  1. Posterior mean and the MLE are similar
  2. Posterior variance is similar to the inverse of the Hessian matrix.

Hence credible interval and confidence intervals will be indistinguishable for large sample size. Effect of the choice of the prior distribution vanishes. How large a sample size should be for this to happen? It depends on the number of parameters in the model and how strong the prior distribution is.

No math please!

Bayesian and ML inference using MCMC and data cloning

We now show how one can compute the posterior distribution for any choice of the prior distribution without analytically calculating the integral in the denominator. We will generate the data under the Bernoulli model. You can change the parameters as you wish when you run the code.

Simulate a simple data set with 30 observations:

library(dclone)
## Loading required package: coda
## Loading required package: parallel
## Loading required package: Matrix
## dclone 2.3-1      2023-04-07
phi.true = 0.3
n = 30
Y = rbinom(n,1,phi.true)

table(Y)
## Y
##  0  1 
## 20 10

Analytical MLE:

(MLE.est = sum(Y)/n)
## [1] 0.3333333

Bayesian inference

We will use the JAGS program and the dclone R package.

First, we need to define the model function. This is the critical component.

Occ.model = function(){
  # Likelihood 
  for (i in 1:n){
    Y[i] ~ dbin(phi_occ, 1)
  }
  # Prior
  phi_occ ~ dbeta(1, 1)
}

Second, we need to provide the data to the model and generate random numbers from the posterior. We will discuss different options later.

dat = list(Y=Y, n=n)
Occ.Bayes = jags.fit(data=dat, params="phi_occ", model=Occ.model)
## Registered S3 method overwritten by 'R2WinBUGS':
##   method            from  
##   as.mcmc.list.bugs dclone
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 30
##    Unobserved stochastic nodes: 1
##    Total graph size: 33
## 
## Initializing model
summary(Occ.Bayes)
## 
## Iterations = 2001:7000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean             SD       Naive SE Time-series SE 
##      0.3426129      0.0811765      0.0006628      0.0008021 
## 
## 2. Quantiles for each variable:
## 
##   2.5%    25%    50%    75%  97.5% 
## 0.1956 0.2840 0.3394 0.3981 0.5087
plot(Occ.Bayes)

The summary describes the posterior distribution: its mean, standard deviation, and quantiles.

This was quite easy. Now we use data cloning to compute the MLE and its variance using MCMC.

Data cloning in a nutshell

As you all know, at least in this simple situation, we can write down the likelihood function analytically. We can also use calculus and/or numerical optimization such as the optim() function in R to get the location of the maximum and its Hessian matrix. But suppose we do not want to go through all of that and instead want to use the MCMC algorithm. Why? Because it is easy and can be generalized to more complex hierarchical models.

Earlier we noted that as we increase the sample size, the Bayesian posterior converges to the sampling distribution of the MLE. We, obviously, cannot increase the sample size. The data are given to us. Data cloning conducts a computational trick to increase the sample size. We clone the data!

Imagine a sequence of K independent researchers.

What is happening with these sequential posterior distributions?

The posterior distribution is converging to a single point; a degenerate distribution. This is identical to the MLE!

  1. As we increase the number of clones, the mean of the posterior distributions converges to the MLE.
  2. The variance of the posterior distribution converges to 0.
  3. If we scale the posterior distribution with the number of clones (that is, multiply the posterior variance by the number of clones), it is identical to the inverse of the Fisher information matrix.

You can play with the number of clones and see the effect on the posterior distribution using this Shiny app (R code for the app)

We do not need to implement this procedure sequentially. The matrix of these K datasets is of dimension (n,K) with identical columns.

\(\left[\begin{array}{cccccccccc} y_{1} & y_{1} & y_{1} & y_{1} & y_{1} & y_{1} & y_{1} & y_{1} & y_{1} & y_{1}\\ y_{2} & y_{2} & y_{2} & y_{2} & y_{2} & y_{2} & y_{2} & y_{2} & y_{2} & y_{2}\\ y_{3} & y_{3} & y_{3} & y_{3} & y_{3} & y_{3} & y_{3} & y_{3} & y_{3} & y_{3}\\ y_{4} & y_{4} & y_{4} & y_{4} & y_{4} & y_{4} & y_{4} & y_{4} & y_{4} & y_{4}\\ y_{5} & y_{5} & y_{5} & y_{5} & y_{5} & y_{5} & y_{5} & y_{5} & y_{5} & y_{5} \end{array}\right]\)

We use the Bayesian procedure to analyze these data. The model function used previously can be used with a minor modification to do this.

Occ.model.dc = function(){
  # Likelihood 
  for(k in 1:ncl){
    for (i in 1:n){
      Y[i,k] ~ dbin(phi_occ, 1)
    }
  }
  # Prior
  phi_occ ~ dbeta(1, 1)
}

To match this change in the model, we need to turn the original data into an array.

Y = array(Y, dim=c(n, 1))
Y = dcdim(Y)

When defining the data, we need to add another index ncl for the cloned dimension. It gets multiplied by the number of clones.

dat = list(Y=Y, n=n, ncl=1)
# 2 clones of the Y array
dclone(Y, 2)
##       clone.1 clone.2
##  [1,]       1       1
##  [2,]       0       0
##  [3,]       0       0
##  [4,]       0       0
##  [5,]       1       1
##  [6,]       0       0
##  [7,]       1       1
##  [8,]       1       1
##  [9,]       0       0
## [10,]       0       0
## [11,]       0       0
## [12,]       0       0
## [13,]       0       0
## [14,]       0       0
## [15,]       0       0
## [16,]       1       1
## [17,]       0       0
## [18,]       1       1
## [19,]       1       1
## [20,]       0       0
## [21,]       1       1
## [22,]       1       1
## [23,]       0       0
## [24,]       0       0
## [25,]       0       0
## [26,]       0       0
## [27,]       0       0
## [28,]       0       0
## [29,]       0       0
## [30,]       1       1
## attr(,"n.clones")
## [1] 2
## attr(,"n.clones")attr(,"method")
## [1] "dim"
## attr(,"n.clones")attr(,"method")attr(,"drop")
## [1] TRUE
# 2 clones of the data list - this is not what we want
dclone(dat, 2)
## $Y
##       clone.1 clone.2
##  [1,]       1       1
##  [2,]       0       0
##  [3,]       0       0
##  [4,]       0       0
##  [5,]       1       1
##  [6,]       0       0
##  [7,]       1       1
##  [8,]       1       1
##  [9,]       0       0
## [10,]       0       0
## [11,]       0       0
## [12,]       0       0
## [13,]       0       0
## [14,]       0       0
## [15,]       0       0
## [16,]       1       1
## [17,]       0       0
## [18,]       1       1
## [19,]       1       1
## [20,]       0       0
## [21,]       1       1
## [22,]       1       1
## [23,]       0       0
## [24,]       0       0
## [25,]       0       0
## [26,]       0       0
## [27,]       0       0
## [28,]       0       0
## [29,]       0       0
## [30,]       1       1
## attr(,"n.clones")
## [1] 2
## attr(,"n.clones")attr(,"method")
## [1] "dim"
## attr(,"n.clones")attr(,"method")attr(,"drop")
## [1] TRUE
## 
## $n
## [1] 30 30
## attr(,"n.clones")
## [1] 2
## attr(,"n.clones")attr(,"method")
## [1] "rep"
## 
## $ncl
## [1] 1 1
## attr(,"n.clones")
## [1] 2
## attr(,"n.clones")attr(,"method")
## [1] "rep"

Notice that this changes n also. We do not want that, we want to keep n unchanged.

dclone(dat, 2, unchanged="n", multiply="ncl")
## $Y
##       clone.1 clone.2
##  [1,]       1       1
##  [2,]       0       0
##  [3,]       0       0
##  [4,]       0       0
##  [5,]       1       1
##  [6,]       0       0
##  [7,]       1       1
##  [8,]       1       1
##  [9,]       0       0
## [10,]       0       0
## [11,]       0       0
## [12,]       0       0
## [13,]       0       0
## [14,]       0       0
## [15,]       0       0
## [16,]       1       1
## [17,]       0       0
## [18,]       1       1
## [19,]       1       1
## [20,]       0       0
## [21,]       1       1
## [22,]       1       1
## [23,]       0       0
## [24,]       0       0
## [25,]       0       0
## [26,]       0       0
## [27,]       0       0
## [28,]       0       0
## [29,]       0       0
## [30,]       1       1
## attr(,"n.clones")
## [1] 2
## attr(,"n.clones")attr(,"method")
## [1] "dim"
## attr(,"n.clones")attr(,"method")attr(,"drop")
## [1] TRUE
## 
## $n
## [1] 30
## 
## $ncl
## [1] 2
## attr(,"n.clones")
## [1] 2
## attr(,"n.clones")attr(,"method")
## [1] "multi"

The dc.fit function takes the familiar arguments to determine how to clone the data list.

Occ.DC = dc.fit(data=dat, params="phi_occ", model=Occ.model.dc,
  n.clones=c(1, 2, 5), unchanged="n", multiply="ncl")
## 
## Fitting model with 1 clone 
## 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 30
##    Unobserved stochastic nodes: 1
##    Total graph size: 34
## 
## Initializing model
## 
## 
## Fitting model with 2 clones 
## 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 60
##    Unobserved stochastic nodes: 1
##    Total graph size: 64
## 
## Initializing model
## 
## 
## Fitting model with 5 clones 
## 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 150
##    Unobserved stochastic nodes: 1
##    Total graph size: 154
## 
## Initializing model
summary(Occ.DC)
## 
## Iterations = 2001:7000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## Number of clones = 5
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean      SD   DC SD  Naive SE Time-series SE R hat
## phi_occ 0.3352 0.03816 0.08532 0.0003116      0.0003994     1
## 
## 2. Quantiles for each variable:
## 
##   2.5%    25%    50%    75%  97.5% 
## 0.2623 0.3096 0.3345 0.3607 0.4117
plot(Occ.DC)

Notice the Mean, DC SD, and R hat columns in the summary. These refer to the maximum likelihood estimate, the asymptotic standatd error (SD of the posterior times square root of K), and the Gelman-Rubin disgnostic:

coef(Occ.DC) # MLE
##   phi_occ 
## 0.3351961
dcsd(Occ.DC) # SE=SD*sqrt(K)
##    phi_occ 
## 0.08532459 
## attr(,"method")
##       Y       n     ncl 
##   "dim"      NA "multi"
gelman.diag(Occ.DC) # R hat
## Potential scale reduction factors:
## 
##         Point est. Upper C.I.
## phi_occ          1          1

Summaries for the clones

Summaries of the posterior distributions for the different numbers of clones are saved and we can print these out with the dctable() command. We can visualize these with the plot function.

dctable(Occ.DC)
## $phi_occ
##   n.clones      mean         sd      2.5%       25%       50%       75%
## 1        1 0.3449931 0.08320118 0.1931794 0.2864487 0.3411990 0.3997247
## 2        2 0.3380204 0.06024657 0.2253682 0.2955537 0.3363799 0.3782251
## 3        5 0.3351961 0.03815832 0.2622565 0.3095898 0.3344979 0.3606665
##       97.5%     r.hat
## 1 0.5179685 1.0002029
## 2 0.4595045 0.9999392
## 3 0.4116999 1.0004954
## 
## attr(,"class")
## [1] "dctable"
dctable(Occ.DC) |> plot()

Some data cloning related diagnostics are printed with the dcdiag() function. We will discuss these statistics in detail. The most important thing to ckeck is that the solid line follows the scattered line for lambda.max, i.e. decreases with the number of clones.

dcdiag(Occ.DC)
##   n.clones  lambda.max     ms.error    r.squared     r.hat
## 1        1 0.006922436 0.0026932905 0.0006577752 1.0002029
## 2        2 0.003629650 0.0057757427 0.0016336320 0.9999392
## 3        5 0.001456057 0.0006324413 0.0002696127 1.0004954
dcdiag(Occ.DC) |> plot()

Regression models

Now we will generalize these models to account for covariates. We will consider Logistic regression but also comment on how to change it to Probit regression easily. Similarly we show how this basic prototype can be modified to do linear and non-linear regression, Poisson regression etc.

n = 30 # sample size
X1 = rnorm(n) # a covariate
X = model.matrix(~X1)
beta.true = c(0.5, 1)
link_mu = X %*% beta.true # logit scale

Logistic regression model

phi_occ = plogis(link_mu) # prob scale
Y = rbinom(n, 1, phi_occ)

Maximum likelihood estimate using glm():

MLE.est = glm(Y ~ X1, family="binomial")

Bayesian analysis

Occ.model = function(){
  # Likelihood 
  for (i in 1:n){
    phi_occ[i] <- ilogit(X[i,] %*% beta)
    Y[i] ~ dbin(phi_occ[i], 1)
  }
  # Prior
  beta[1] ~ dnorm(0, 1)
  beta[2] ~ dnorm(0, 1)
}

Now we need to provide the data to the model and generate random numbers from the posterior. We will discuss different options later.

dat = list(Y=Y, X=X, n=n)
Occ.Bayes = jags.fit(data=dat, params="beta", model=Occ.model)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 30
##    Unobserved stochastic nodes: 2
##    Total graph size: 186
## 
## Initializing model
summary(Occ.Bayes)
## 
## Iterations = 2001:7000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##            Mean     SD Naive SE Time-series SE
## beta[1] -0.1373 0.3839 0.003135       0.004150
## beta[2]  1.1414 0.4994 0.004077       0.005254
## 
## 2. Quantiles for each variable:
## 
##            2.5%     25%    50%    75%  97.5%
## beta[1] -0.9026 -0.3918 -0.136 0.1189 0.6079
## beta[2]  0.2114  0.7971  1.119 1.4632 2.1778
plot(Occ.Bayes)

pairs(Occ.Bayes)

Now we modify this to get the MLE using data cloning.

Occ.model_dc = function(){
  # Likelihood 
  for (k in 1:ncl){
    for (i in 1:n){
      phi_occ[i,k] <- ilogit(X[i,,k] %*% beta)
      Y[i,k] ~ dbin(phi_occ[i,k],1)
    }
  }
  # Prior
  beta[1] ~ dnorm(0, 1)
  beta[2] ~ dnorm(0, 1)
}

Now we need to provide the data to the model and generate random numbers from the posterior. We will discuss different options later.

Y = array(Y, dim=c(n, 1))
X = array(X, dim=c(dim(X), 1))
# clone the objects
Y = dcdim(Y)
X = dcdim(X)

Data cloning with dc.fit():

dat = list(Y=Y, X=X, n=n, ncl=1)
Occ.DC = dc.fit(data=dat, params="beta", model=Occ.model_dc,
  n.clones=c(1, 2, 5), unchanged="n", multiply="ncl")
## 
## Fitting model with 1 clone 
## 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 30
##    Unobserved stochastic nodes: 2
##    Total graph size: 187
## 
## Initializing model
## 
## 
## Fitting model with 2 clones 
## 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 60
##    Unobserved stochastic nodes: 2
##    Total graph size: 307
## 
## Initializing model
## 
## 
## Fitting model with 5 clones 
## 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 150
##    Unobserved stochastic nodes: 2
##    Total graph size: 667
## 
## Initializing model
summary(Occ.DC)
## 
## Iterations = 2001:7000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## Number of clones = 5
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean     SD  DC SD Naive SE Time-series SE R hat
## beta[1] -0.155 0.1777 0.3975 0.001451       0.001743     1
## beta[2]  1.350 0.2579 0.5767 0.002106       0.002691     1
## 
## 2. Quantiles for each variable:
## 
##            2.5%     25%     50%      75%  97.5%
## beta[1] -0.5010 -0.2764 -0.1559 -0.03285 0.1901
## beta[2]  0.8591  1.1735  1.3451  1.52043 1.8615
plot(Occ.DC)

pairs(Occ.DC)

These are the familiar functions for the asymptotic ML inference (already accounted for the number of clones):

coef(Occ.DC) # MLE
##    beta[1]    beta[2] 
## -0.1550155  1.3496525
dcsd(Occ.DC) # SE
##  beta[1]  beta[2] 
## 0.397459 0.576738
vcov(Occ.DC) # asymptotic VCV
##              beta[1]      beta[2]
## beta[1]  0.157973685 -0.006228316
## beta[2] -0.006228316  0.332626678
confint(Occ.DC, level=0.95) # 95% CI
##              2.5 %    97.5 %
## beta[1] -0.9340209 0.6239899
## beta[2]  0.2192669 2.4800381

Let’s check the posterior summaries and the DC diagnostics:

dctable(Occ.DC)
## $`beta[1]`
##   n.clones       mean        sd       2.5%        25%        50%         75%
## 1        1 -0.1280467 0.3745953 -0.8691111 -0.3800893 -0.1219638  0.12364566
## 2        2 -0.1484927 0.2789263 -0.6948277 -0.3341926 -0.1478030  0.03680839
## 3        5 -0.1550155 0.1777491 -0.5010395 -0.2764246 -0.1558643 -0.03285093
##       97.5%    r.hat
## 1 0.5952610 1.000780
## 2 0.4013616 1.000626
## 3 0.1900681 1.000130
## 
## $`beta[2]`
##   n.clones     mean        sd      2.5%       25%      50%      75%    97.5%
## 1        1 1.134465 0.4867618 0.2279742 0.8030493 1.116657 1.449563 2.130883
## 2        2 1.261849 0.3865462 0.5281616 0.9970322 1.252060 1.514078 2.040573
## 3        5 1.349652 0.2579251 0.8591293 1.1735045 1.345107 1.520426 1.861508
##      r.hat
## 1 1.001643
## 2 1.000037
## 3 1.000474
## 
## attr(,"class")
## [1] "dctable"
dctable(Occ.DC) |> plot()

dcdiag(Occ.DC)
##   n.clones lambda.max    ms.error   r.squared    r.hat
## 1        1  0.2371100 0.015734348 0.002901883 1.001379
## 2        2  0.1494787 0.007890596 0.001157451 1.000639
## 3        5  0.0665697 0.006034254 0.001506192 1.000306
dcdiag(Occ.DC) |> plot()

We hope you can see the pattern in how we are changing the prototype model function and the data function. If we want to do a Normal linear regression and Poisson regression we can modify the regression program above easily.

Linear regression

The following section issustrates Gaussian linear regression.

n = 30
X1 = rnorm(n)
X = model.matrix(~X1)
beta.true = c(0.5, 1)
link_mu = X %*% beta.true

# Linear regression model
mu = link_mu
sigma.e = 1
Y = rnorm(n,mean=mu,sd=sigma.e)

# MLE
MLE.est = glm(Y ~ X1, family="gaussian")

# Bayesian analysis
Normal.model = function(){
  # Likelihood 
  for (i in 1:n){
    mu[i] <- X[i,] %*% beta
    Y[i] ~ dnorm(mu[i],prec.e)
  }
  # Prior
  beta[1] ~ dnorm(0, 1)
  beta[2] ~ dnorm(0, 1)
  prec.e ~ dlnorm(0, 1)
}
dat = list(Y=Y, X=X, n=n)
Normal.Bayes = jags.fit(data=dat, params=c("beta","prec.e"), model=Normal.model)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 30
##    Unobserved stochastic nodes: 3
##    Total graph size: 157
## 
## Initializing model
summary(Normal.Bayes)
## 
## Iterations = 2001:7000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean     SD Naive SE Time-series SE
## beta[1] 0.5172 0.1716 0.001401       0.001388
## beta[2] 0.8064 0.2355 0.001923       0.001923
## prec.e  1.1683 0.3028 0.002472       0.003728
## 
## 2. Quantiles for each variable:
## 
##           2.5%    25%    50%    75%  97.5%
## beta[1] 0.1771 0.4061 0.5161 0.6321 0.8524
## beta[2] 0.3429 0.6514 0.8089 0.9609 1.2695
## prec.e  0.6612 0.9547 1.1394 1.3556 1.8460
plot(Normal.Bayes)

pairs(Normal.Bayes)

# MLE using data cloning.
Normal.model_dc = function(){
  # Likelihood 
  for (k in 1:ncl){
    for (i in 1:n){
      mu[i,k] <- X[i,,k] %*% beta
      Y[i,k] ~ dnorm(mu[i,k],prec.e)
    }
  }
  # Prior
  beta[1] ~ dnorm(0, 1)
  beta[2] ~ dnorm(0, 1)
  prec.e ~ dlnorm(0, 1)
}
Y = array(Y, dim=c(n, 1))
X = array(X, dim=c(dim(X), 1))
Y = dcdim(Y)
X = dcdim(X)
dat = list(Y=Y, X=X, n=n, ncl=1)
Normal.DC = dc.fit(data=dat, params=c("beta","prec.e"), model=Normal.model_dc,
  n.clones=c(1, 2, 5), unchanged="n", multiply="ncl")
## 
## Fitting model with 1 clone 
## 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 30
##    Unobserved stochastic nodes: 3
##    Total graph size: 158
## 
## Initializing model
## 
## 
## Fitting model with 2 clones 
## 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 60
##    Unobserved stochastic nodes: 3
##    Total graph size: 278
## 
## Initializing model
## 
## 
## Fitting model with 5 clones 
## 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 150
##    Unobserved stochastic nodes: 3
##    Total graph size: 638
## 
## Initializing model
summary(Normal.DC)
## 
## Iterations = 2001:7000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## Number of clones = 5
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean      SD  DC SD  Naive SE Time-series SE R hat
## beta[1] 0.5277 0.07414 0.1658 0.0006054      0.0006167     1
## beta[2] 0.8511 0.10385 0.2322 0.0008479      0.0008432     1
## prec.e  1.2412 0.14237 0.3183 0.0011624      0.0015376     1
## 
## 2. Quantiles for each variable:
## 
##           2.5%    25%    50%    75%  97.5%
## beta[1] 0.3838 0.4774 0.5278 0.5770 0.6736
## beta[2] 0.6474 0.7814 0.8511 0.9203 1.0567
## prec.e  0.9766 1.1418 1.2362 1.3346 1.5352
plot(Normal.DC)

pairs(Normal.DC)

Why use MCMC based Bayesian and data cloning?

  1. Writing the model function is much more intuitive than writing the likelihood function, prior, etc.
  2. Do not need to do numerical integration or numerical optimization.
  3. Data cloning overcomes multimodality of the likelihood function. Entire prior distribution essentially works as a set of starting values. In the usual optimization, starting values can be quite important when the function is not well behaved. By using data cloning, except for the global maximum, all the local maxima tend to 0.
  4. Asymptotic variance of the MLE is simple to obtain. It is also more stable than computing the inverse of the second derivative of the log-likelihood function numerically.